Indexing and search system and method with add-on request, indexing and search engines

ABSTRACT

The indexing and search system comprises means ( 10 ) for storing an indexing base ( 24 ), means ( 22 ) for indexing resources ( 18 ) to create and update the indexing base ( 24 ), means ( 40 ) for searching for resources and adapted to interrogate the indexing base ( 24 ) on the basis of a request, and request-extender means ( 38 ) for obtaining an extended request on the basis of an initial request ( 34 ) formulated by a user and including initial terms (R 1 ), by adding to said initial request ( 34 ) terms which are neighbors to the initial terms. The extender means ( 38 ) further comprise means ( 36 ) for limiting the extension of the initial request by adding thereto only terms that are neighbors to initial terms that are not general, i.e. Terms that do not have too great a number of neighbors. Means ( 20 ) for generalizing indexing may also be implemented in the invention.

The present invention relates to an indexing and search system.

More precisely, the invention relates to an indexing and search systemof the type comprising means for storing an indexing base, means forindexing resources to create and update the indexing base, means forsearching for resources and adapted to interrogate the indexing base onthe basis of a request, and request-extender means for obtaining anextended request on the basis of an initial request formulated by a userand including initial terms, by adding to said initial request termswhich are neighbors to the initial terms.

The invention also relates to a method of indexing and to a method ofsearching implemented by the system, and also to indexing and searchengines.

In general, indexing and search systems include a semantic knowledgebase containing a set of terms, each term possibly being associated withother terms in the same base which are semantically close thereto. Thus,when a user formulates a request in order to obtain in return pertinentdocuments that have been indexed by the indexing means, the search meansenrich the initial request as formulated by the user with termsextracted from the knowledge base and which are semantically close tothe initial terms of the request. This extension of the initial requestby adding new terms that are neighbors to the initial terms can bereiterated. As a result, the search for documents is undertaken on thebasis of an extended request having a larger number of terms than theinitial request.

However, amongst the terms in the semantic knowledge base, some termshave a large number of neighboring terms, because they are very general.Thus, if a request includes any such general terms, when the request isextended there is a risk that it will end up having too great a numberof terms and the search for documents runs the risk of being relativelyineffective and of consuming a large amount of time.

To mitigate that problem, certain indexing and search systems impose apredetermined maximum number of terms on the extended request. Thosesearch and indexing systems stop extending a request once the maximum isreached, which means that the terms selected for the extended requestare arbitrary. The search for documents then consumes less time, but tothe detriment of pertinence.

The invention seeks to remedy the drawbacks of the above-mentionedconventional indexing and search systems, by providing a system thatenables initial requests to be extended while still maintaining theeffectiveness of the search for documents.

The invention thus provides an indexing and search system of theabove-mentioned type, characterized in that the extender means includemeans for limiting the extension of the initial request by addingthereto only terms that are neighbors of initial terms that are notgeneral, i.e. Terms that do not have too large a number of neighboringterms.

Thus, an indexing and search system of the invention enables theextension of the initial request to be limited in pertinent manner, i.e.By encouraging extension from precise terms rather than from generalterms.

An indexing and search system of the invention may further include oneor more of the following characteristics:

-   -   it includes means for extracting terms from each resource, and        means for generalizing the indexing of said resource by adding        to the extracted terms, general terms that are neighbors        thereto;    -   the request-extender means include means for generalizing the        initial request by adding to the initial terms of the request,        general terms that are neighbors thereto;    -   the extender means comprise a semantic knowledge base containing        a set of terms within which the initial terms of the request can        be found, each term being optionally associated with a list of        at least one neighboring term taken from said semantic knowledge        base;    -   a term of the semantic knowledge base is a general term if it is        associated with a list containing a number of neighboring terms        that is greater than a predetermined threshold;    -   the system includes means for generating a limitation knowledge        base and a generalization knowledge base from the semantic        knowledge base, the limitation knowledge base being associated        with the means for limiting extension and the generalization        knowledge base being independent of the limitation knowledge        base and being associated with the means for generalizing the        initial request;    -   the limitation knowledge base contains all of the terms of the        semantic knowledge base, and its terms that correspond to        general terms of the semantic knowledge base are not associated        with any list of neighboring terms; and    -   the generalization knowledge base contains all of the terms of        the semantic knowledge base, and the lists of neighboring terms        that it contains comprise only those terms that correspond to        general terms of the semantic knowledge base.

The invention also provides a method of searching indexed resources, themethod comprising the following steps:

-   -   issuing an initial request formulated by a user and including        initial terms;    -   extending the initial request by adding to said initial request        terms that are neighbors to the initial terms;    -   the method being characterized in that the extension step        includes a sub step of extending the initial request by adding        thereto only terms that are neighbors to initial terms that are        not general, i.e. Initial terms that do not have too great a        number of neighboring terms.

A method of searching indexed resources in accordance with the inventionmay further include the characteristic whereby the extension stepincludes a sub step of generalizing the initial request by adding to theinitial terms of the request general terms that are neighbors thereto.

The invention also provides a method of indexing resources including astep of extracting terms from each resource, the method beingcharacterized in that it further includes a step of generalizing theindexing of said resource by adding to said extracted terms generalterms that are neighbors thereto.

The invention also provides an engine for indexing resources, the engineincluding means for extracting terms from each resource and beingcharacterized in that it includes means for generalizing the indexing ofsaid resource by adding to the extracted terms general terms that areneighbors thereto.

Finally, the invention also provides an engine for searching indexedresources, the engine including means for extracting initial terms froman initial request formulated by a user, means for searching theresources and adapted to interrogate an indexing base on the basis of arequest, and request-extender means for obtaining an extended requestfrom the initial request, the engine being characterized in that theextenderss means comprise means for limiting the extension of theinitial request by adding thereto only terms that are neighbors toinitial terms that are not general, i.e. Terms that do not have toogreat a number of neighboring terms.

A search engine of the invention may further include the characteristicwhereby the extender means include means for generalizing the initialrequest by adding to the initial terms of the request, general termsthat are neighbors thereto.

The invention will be better understood from the following descriptiongiven purely by way of example and made with reference to theaccompanying drawings, in which:

FIG. 1 is a diagram of the general structure of an indexing andsearching system of the invention; and

FIGS. 2 and 3 show the structure of the knowledge bases of the indexingand search system shown in FIG. 1, in two distinct embodiments.

The indexing and search system shown in FIG. 1 comprises storage means10. It further comprises an indexing engine 12 and a search engine 14,both connected to the storage means 10.

The indexing engine 12 includes term-extractor means 16 receiving adocument resource 18 as input from any document base accessible, e.g.,via the Internet. By a known method of extraction, the means 16 supplyterms T₁, T₂ that are extracted automatically from the document 18 andthat are representative thereof. Each term extracted from the document18 is forwarded to indexing-extender means 20 a.

The indexing-extender means 20 a supply, as output, the terms T₁ and T₂associated with terms that are neighbors to T₁ and T₂ and that are takenfrom the storage means 10. For example, they supply a term T₃ that issemantically neighboring to the term T₁. They transmit the terms T₁, T₂,and T₃ to indexing means 22.

A reference D₁, for the document 18 is also transmitted to the indexingmeans 22. Finally, the extractor means 16 also transmit data to theindexing means 22 specifying the respective positions P₁ and P₂ of theextracted terms T₁ and T₂ in the document 18. The function of theindexing means 22 is to transfer all of this data to the storage means10.

For this purpose, the storage means 10 include an indexing base 24. Theindexing base 24 is made up of triplets each comprising a term, areference to a document from which the term has been extracted, and theposition of the term in that document. Thus, in the example given above,the indexing base contains a first triplet (T₁, D₁, P₁), a secondtriplet (T₂, D₁, P₂), and a third triplet (T₃, D₁, P₁). It should beobserved that the term T₃ which is derived from T₁ is associated withthe position P₁ of T₁ in D₁ .

The storage means 10 also include a semantic knowledge base 26comprising a set of terms. The terms contained in this semanticknowledge base 26 represent all of the terms recognized by the indexingand search system, and they include in particular the terms T₁, T₂, andT₃.

Optionally, each term in the semantic knowledge base 26 is associatedwith a list of at least one semantically neighboring term taken from thesame knowledge base 26.

The storage means 10 also include two distinct knowledge bases 28 and 30constructed from the semantic knowledge base 26.

The first of these two distinct knowledge bases is a limitationknowledge base 28 which contains the same terms as the knowledge base26. However, its terms that correspond to general terms of the knowledgebase 26 are not associated with any list of neighboring terms, unlikethe corresponding general terms of the semantic knowledge base 26.

The second knowledge base is a generalization knowledge base 30 whichcontains all of the terms of the knowledge base 26. The lists ofneighboring terms that it contains comprise only terms corresponding togeneral terms of the knowledge base 26.

The knowledge base 26 is useful for generating the indexing andgeneralization knowledge bases 28 and 30, but it is not used by theindexing and search system. Its presence in the storage means 10 istherefore not necessary to enable the indexing and search system tooperate. It is necessary solely for updating the knowledge bases 28 and30 whenever the set of stored terms is modified.

The indexing extender means 20 a are connected to read thegeneralization knowledge base 30. Thus, when the indexing extender means20 a receive a term input thereto, they output that term together withgeneral terms taken from the list of terms that are neighbors to theterm that has been received as input, which list is provided by thegeneralization knowledge base 30. The unit constituted by theindexing-extender means 20 a and by the generalization knowledge base 30thus forms indexing generalization means 20.

The search engine 14 includes term-extractor means 32 for extractingterms from an initial request 34 formulated by a user.

These extractor means 32 receive as input, a request 34 as formulated bythe user, and they output a list of terms extracted from said requestand contained in the knowledge base 26, such as the term R₁.

This list of terms is supplied to first request-extender means 35 a.Like the indexing-extender means 20 a, the first request-extender means35 a are connected to read the generalization knowledge base 30 and toco-operate therewith to form means 35 for generalizing the initialrequest 34. The first request-extender means 35 a outputs the term R₁together with terms R₂ and R₃ belonging to the list of neighboring termsassociated with the term R₁ in the generalization knowledge base 30.

The terms R₁, R₂, and R₃ are supplied as inputs to secondrequest-extender means 36 a. These second request-extender means 36 aare identical to the first request-extender means 35 a, but they areconnected to read the limitation knowledge base 28. As mentioned above,the general terms of the knowledge base 28 are not associated with anylist of neighboring terms. Thus, the second request-extender means 36 ain association with the limitation knowledge base 28 forms means 36 forlimiting request extension. These means output an extended requestconstituted by the terms R₁, R₂, and R₃, and also a term R₄ supplied bythe limitation knowledge base 28.

The generalization means 35 and the extension limitation means 36,possibly together with the knowledge base 28, constitute means 38 forextending the initial request. These means may be activated severaltimes in an iterative process in order to extend the initial requestprogressively and output a final request which is transmitted to thesearch means 40.

The search means 40 are connected to the indexing base 24 of the storagemeans 10 and in response to the initial request formulated by the user34 they supply a set 42 of document resources selected as a function ofthe terms R₁, R₂, R₃, and R₄ of the extended request.

A first implementation of the knowledge base 26 is shown in FIG. 2 ingraphical form.

In this figure, the graphs comprise nodes such as nodes A, B, C, D, E,F, and G, each representing a term of the knowledge base. The nodes areoptionally connected together by oriented arcs representing semanticlinks meaning “has as a directly-neighboring term”. Thus, term A hasterm B as a direct neighbor.

It can be considered that a term Y is a neighbor of a term X if thereexists a path of no more than two oriented arcs from X to Y. Thus, termB has the term E as a direct neighbor. Term E is thus a neighbor of theterm A.

It may also be considered that a term of the knowledge base 26 is ageneral term if it is has at least five direct neighbors.

In the example shown, only term A is a general term. It has six directneighbors, including B and C. Term B has term F as its only directneighbor. Term C has three direct neighbors B, F, and G. The terms B, C,E, F, and G are thus terms that are neighbors to term A.

Term C has four neighbors, B, E, F, and G. Term B has three neighbors D,E, and F. Term D has six neighbors including A and C, and term E has twoneighbors, D and A. Terms F and G do not have any neighbors.

In the limitation knowledge base 28, the general term A has no directneighbor since it is a general term in the knowledge base 26. However,all of the other terms have the same direct neighbors as in theknowledge base 26. That is to say only those oriented arcs that have Aas their origin are omitted from the limitation knowledge base 28.

The generalization knowledge base 30 also has the same terms as theknowledge base 26. However the direct neighbors of a term in this basecomprise all of the terms corresponding to general terms in theknowledge base 26 to which said term is a neighbor in said initial base.Thus, in the generalization knowledge base 30, only term A, which is theonly general term in the knowledge base 26, is the direct neighbor ofany other terms. In particular, it is the direct neighbor of terms B, C,E, F, and G which are its neighbors in the initial knowledge base, butit is not the direct neighbor of term D which does not belong to itsneighborhood in the knowledge base 26.

Thus, while indexing documents, such as the document 18, thegeneralization knowledge base 30 supplies the means 20 a with generalterms that are neighbors to the terms extracted from the documents 18.

However, while extending a request, the limitation knowledge base 28does not supply the second request-extender means 36 a with terms thatare neighbors to general terms in the request, since the correspondingoriented arcs have been omitted. This would be pointless, sincedocuments containing terms in the semantic neighborhood of general termsin the request have already been indexed with said general terms by theindexing generalization means 20.

The second embodiment shown in FIG. 3 differs from the first embodimentby the way in which the limitation knowledge base 28 and thegeneralization knowledge base 30 are generated from the knowledge base26.

This embodiment makes it possible to introduce the notion of thedistance between a document and the terms used to index it, by creatingartificial terms. Thus, in the limitation knowledge base 28, each termcorresponding to a general term of the knowledge base 26 is representedby a plurality of terms, all of which except one are artificial terms.The real instance of a general term has in its direct neighborhood onlythe set of general artificial instances. All of the other terms of thelimitation knowledge base 28 have the same semantic neighborhood as thecorresponding terms in the knowledge base 26.

Finally, the distances between real instances of general terms and eachcorresponding artificial instance are defined.

In the generalization knowledge base 30, the only terms which have adirect neighbor are terms which, in the initial knowledge base, formpart of the neighborhood of a general term.

The semantic neighborhood of a term in the generalization knowledge base30 comprises all of the general terms of which it forms a part of thesemantic neighborhood in the knowledge base 26, but each of thesegeneral terms is represented in the neighborhood by its real instance orby an artificial instance, as a function of the distance between saidgeneral term and the term under consideration.

Thus, as shown in FIG. 3, in the generalization knowledge base 30, theterms B and C have as neighbors the real instance of the general term A,whereas terms E, F, and G which are not neighbors of the general term A,are neighbors of the artificial instance of A.

By means of this embodiment, a request having the general term A onlywill enable a documentary resource having term B only to be found with alevel of pertinence that is greater than a document resource thatincludes term E only.

The extension of the request including the general term A to a requestincluding the general term A and its artificial instance makes itpossible to find the second document, but with a level of pertinencethat is lower than the first document, because of the distance betweenthe general term A and its artificial instance in the limitationknowledge base 28.

It can clearly be seen that an indexing and search system with requestextension in accordance with the invention makes it possible to optimizesearching for document resources by controlling the extent to which arequest is extended.

Nevertheless, it should be observed that the invention is not limited tothe embodiment described above.

In a variant, the storage means 10 need not include a limitationknowledge base 28 and a generalization knowledge base 30 generated fromthe knowledge base 26.

Under such circumstances, the indexing generalization means 20 are fullyintegrated in the indexing engine 12 and are connected to read theknowledge base 26. They then include means for extracting only generalterms from the knowledge base 26, including the terms which areneighbors to the terms supplied thereto as inputs.

Similarly, under such circumstances, the request generalization means 35are fully integrated in the search engine 14 and are identical to theindexing generalization means 20.

Finally, likewise under such circumstances, the extension limiting means36 are fully integrated in the search engine 14 and are connected toread the knowledge base 26. They are adapted to add to the termssupplied thereto, only terms which are neighbors to initial terms thatare not general in the knowledge base 26.

1-14. (canceled)
 15. An indexing and search system comprising: a) meansfor storing an indexing base; b) means for indexing resources to createand update the indexing base; c) means for searching for resources andadapted to interrogate the indexing base with a request; and d)request-extender means for obtaining an extended request with an initialrequest formulated by a user and including initial terms (R₁), by addingto the initial request terms which are neighbors to the initial termsand the extender means including means for limiting the extension of theinitial request by adding thereto only terms that are neighbors ofinitial terms and that are not general.
 16. The indexing and searchsystem of claim 15, wherein the system includes means for extractingterms (T₁, T₂) from each resource, and means for generalizing theindexing of the resource by adding to the extracted terms, general terms(T₃) that are neighbors thereto.
 17. The indexing and search system ofclaim 15, wherein the request-extender means include means forgeneralizing the initial request by adding to the initial terms of therequest, general terms that are neighbors thereto.
 18. The indexing andsearch system of claim 15, wherein the extender means comprises asemantic knowledge base containing a set of terms (T₁, T₂, T₃, R₁, R₂,R₃, R₄; A, B, C, D, E, F, G) within which the initial terms (R₁) of therequest can be found, each term being optionally associated with a listof at least one neighboring term taken from the semantic knowledge base.19. The indexing and search system of claim 18, wherein a term (T₁, T₂,T₃, R₁, R₂, R₃, R₄; A, B, C, D, E, F, G) of the semantic knowledge baseis a general term associated with a list containing a number ofneighboring terms that is greater than a predetermined threshold. 20.The indexing and search system of claim 18, wherein the system ncludesmeans for generating a limitation knowledge base and a generalizationknowledge base from the semantic knowledge base, the limitationknowledge base being associated with the means for limiting extensionand the generalization knowledge base being independent of thelimitation knowledge base and being associated with the means forgeneralizing the initial request.
 21. The indexing and search means ofclaim 20, wherein the limitation knowledge base contains all terms ofthe semantic knowledge base and the terms correspond to general terms ofthe semantic knowledge base that are not associated with any list ofneighboring terms.
 22. The indexing and search system of claim 20,wherein the generalization knowledge base contains all terms of thesemantic knowledge base, and the lists of neighboring terms that thegeneralization knowledge base contains comprise only those terms thatcorrespond to general terms of the semantic knowledge base.
 23. A methodof searching indexed resources, the method comprising the followingsteps: a) issuing an initial request formulated by a user and includinginitial terms (R₁); b) extending the initial request by adding to theinitial request terms that are neighbors to the initial terms (R₁) andincludes a sub step of extending the initial request by adding theretoonly terms (R₄) that are neighbors to initial terms that are notgeneral.
 24. The method of searching indexed resources of claim 23,wherein the extension step includes a sub step of generalizing theinitial request by adding to the initial terms of the request generalterms (R₂, R₃) that are neighbors thereto.
 25. A method of indexingresources including a step of extracting terms (T₁, T₂) from eachresource, and generalizing the indexing of the resource by adding to theextracted terms general terms (T₃) that are neighbors thereto.
 26. Anengine for indexing resources, the engine including means for extractingterms from each resource and means for generalizing the indexing of theresource by adding to the extracted terms general terms that areneighbors thereto.
 27. An engine for searching indexed resources, theengine including means for extracting initial terms from an initialrequest formulated by a user, means for searching the resources andadapted to interrogate an indexing base on the basis of a request, andrequest-extender means for obtaining an extended request from theinitial request, the extender means comprising means for limiting theextension of the initial request by adding thereto only terms that areneighbors to initial terms that are not general.
 28. The engine forsearching indexed resources of claim 27, wherein the extender meansinclude means for generalizing the initial request by adding to theinitial terms of the request, general terms that are neighbors thereto.